Skip to main content

Multi-armed Bandits

A k-armed Bandit Problem

In our kk-armed bandit problem, each of the kk actions has an expected or mean reward given that that action is selected; let us call this the value of that action. We denote the action selected on time step tt as AtA_t, and the corresponding reward as RtR_t. The value then of an arbitrary action aa, denoted q(a)q_*(a), is the expected reward given that a is selected:

q(a)E[RtAt=a].\begin{gather*} q_*(a) \doteq \mathbb{E}[R_t|A_t=a]. \end{gather*}

q(a)q_*(a) is not known, so we need to estimate it. We denote the estimated value of action a at time step tt as Qt(a)Q_t(a). We would like Qt(a)Q_t(a) to be close to q(a)q_*(a).

Sample Average Method

Qt(a)sum of rewards when a taken prior to tnumber of times a taken prior to t=i=1t1Ri1Ai=ai=1t11Ai=a\begin{gather*} Q_t(a) \doteq {\text{sum of rewards when a taken prior to t} \over \text{number of times a taken prior to t}} = {\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i=a} \over \sum_{i=1}^{t-1} \mathbb{1}_{A_i=a}} \end{gather*}

By the law of large numbers, Qt(a)Q_t(a) converges to q(a)q_*(a).

Greedy Action

The simplest action selection rule is to select one of the actions with the highest estimated value, that is, one of the greedy actions.

Atarg maxaQt(a).\begin{gather*} A_t \doteq \argmax_a{Q_t(a)}. \end{gather*}

ε\varepsilon-Greedy Method

Greedy action selection always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ε\varepsilon, instead select randomly from among all the actions with equal probability, independently of the action-value estimates.

Incremental Implementaton

The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. We now turn to the question of how these averages can be computed in a computationally efficient manner, in particular, with constant memory and constant per-time-step computation.

Let QnQ_n denote the estimate of its action value after it has been selected n1n - 1 times, which we can now write simply as

QnR1+R2++Rn1n1.\begin{gather*} Q_n \doteq {R_1 + R_2 + \cdots + R_{n-1} \over n-1}. \end{gather*}

Given QnQ_n and the nnth reward, RnR_n, the new average of all nn rewards can be computed by

Qn+1=1ni=1nRi=1n(Rn+i=1n1Ri)=1n(Rn+(n1)1n1i=1n1Ri)=1n(Rn+(n1)Qn)=1n(Rn+nQnQn)=Qn+1n[RnQn]\begin{align*} Q_{n+1} & = {1\over n}\sum_{i=1}^n R_i \\ & = {1\over n}(R_n + \sum_{i=1}^{n-1} R_i) \\ & = {1\over n}(R_n + (n-1){1\over n-1}\sum_{i=1}^{n-1} R_i) \\ & = {1\over n}(R_n + (n-1)Q_n) \\ & = {1\over n}(R_n + nQ_n - Q_n) \\ & = Q_n + {1\over n}[R_n-Q_n] \end{align*}

The general form is

NewEstimateOldEstimate+StepSize[TargetOldEstimate].\begin{gather*} NewEstimate \leftarrow OldEstimate + StepSize[Target - OldEstimate]. \end{gather*}

Pseudocode

My Code

Coming soon

Tracking a Nonstationary Problem

As noted earlier, we often encounter reinforcement learning problems that are effectively nonstationary. In such cases, it makes sense to give more weight to recent rewards than to long-past rewards.

Qn+1Qn+α[RnQn],α(0,1]=αRn+(1α)Qn=αRn+(1α)[αRn1+(1α)Qn1]=αRn+(1α)αRn1+(1α)2Qn1=αRn+(1α)αRn1+(1α)2αRn2++(1α)n1αR1+(1α)nQ1=(1α)nQ1+i=1nα(1α)niRi.\begin{align*} Q_{n+1} & \doteq Q_n + \alpha [R_n-Q_n], \alpha \in (0, 1] \\ & = \alpha R_n + (1-\alpha)Q_n \\ & = \alpha R_n + (1-\alpha)[\alpha R_{n-1}+(1-\alpha)Q_{n-1}] \\ & = \alpha R_n + (1-\alpha)\alpha R_{n-1}+(1-\alpha)^2Q_{n-1} \\ & = \alpha R_n + (1-\alpha)\alpha R_{n-1}+(1-\alpha)^2\alpha R_{n-2}+\cdots +(1-\alpha)^{n-1}\alpha R_1+(1-\alpha)^nQ_1 \\ & = (1-\alpha)^nQ_1+\sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i. \end{align*}

The first term tells us that the contribution Q1Q_1 decreases exponentially with time. The second term tells us the rewards further back in time contribute exponentially less to the sum. Taken together, we see that the influence of our initialization of QQ goes to zero with more and more data. The most recent rewards contribute most to our current estimate.

Optimistic Initial Values

Previously the initial estimated values were assumed to be 0, which is not necessarily optimistic. If we set it to a higher value, by the property of greedy action, the initial reward may not be as big as the initial estimated value. Therefore, it will keep exploring at an early stage.
But, this method is not well-suited for non-stationary problems, and we may not know what the optimistic initial value should be.

Upper-Confidence-Bound (UCB) Action Selection

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effective way of doing this is to select actions according to

Atarg maxa[Qt(a)+clntNt(a)],\begin{gather*} A_t \doteq \argmax_a{\biggr [Q_t(a)+c\sqrt{\ln t \over N_t(a)}\biggr ]}, \end{gather*}

The idea is that the square-root term is a measure of the uncertainty or variance in the estimate of aa’s value. The quantity being maxed over is thus a sort of upper bound on the possible true value of action aa, with cc determining the confidence level. Each time aa is selected the uncertainty is presumably reduced: Nt(a)N_t(a) increments, and, as it appears in the denominator, the uncertainty term decreases. On the other hand, each time an action other than a is selected, tt increases but Nt(a)N_t(a) does not; because t appears in the numerator, the uncertainty estimate increases. The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.